Grammarly, an AI-powered writing assistant tool, enhanced the clarity, coherence, and grammatical accuracy of initial notes and the final draft. Its assistance in identifying errors and suggesting style and vocabulary improvements significantly contributed to the quality of this report.
This project is presented in a single R notebook for demonstration purposes. In a professional setting, a modular structure is recommended to follow best practices and improve maintainability. Data preprocessing, feature engineering, and model training would typically be separated into dedicated scripts or notebooks, with clear management of training, validation, and test datasets to enhance reproducibility and reduce overfitting.
The “17K Mobile Strategy Games” dataset from Kaggle provides comprehensive information about mobile strategy games, including attributes such as average user ratings, price, age rating, app size, release dates, supported languages, and genres. This dataset is an excellent foundation for exploring patterns and relationships that drive user engagement and satisfaction in mobile games. Its diverse range of features allows for an in-depth analysis of the factors influencing user satisfaction.
The initial hypothesis assumes that attributes such as higher number of user ratings, positive sentiment in app descriptions, popular genres, app sizes, and support for multiple languages are associated with higher satisfaction levels, as reflected in Average User Rating. Furthermore, due to the complex relationships between predictors and ratings, user satisfaction is likely influenced by a combination of factors, with the variables interacting in subtle and intricate ways. A secondary hypothesis suggests that apps with high user engagement, as indicated by a larger User Rating Count, tend to have more stable and moderate ratings. However, apps with extreme average ratings (either very high or very low) are likely to exhibit fewer reviews, leading to less reliable averages influenced by limited feedback.
XGBoost was chosen as the modelling algorithm due to its exceptional performance in predictive modelling tasks and its dominance in structured data challenges, particularly in Kaggle competitions. Known for its efficiency and ability to handle complex interactions, XGBoost is well-suited for this analysis, offering a robust platform for both numeric prediction and feature importance analysis, outperforming other methods like deep neural networks on structured data (Lantz, 2023). By leveraging this technique, the report aims to uncover actionable insights and build a reliable model to understand and predict user satisfaction in mobile games.
The original dataset consists of the following features:
URL: The URL to the app’s page in the App StoreID: The app’s assigned IDName: The game’s nameSubtitle: The secondary text under the nameIcon URL: The URL to the app’s iconAverage User Rating: Rounded to nearest .5, requires at least 5 ratingsUser Rating Count: : Number of ratings internationally, null means it is below 5Price: Price in USDIn-app Purchases: Prices of available in-app purchasesDescription: The app’s descriptionDeveloper: The app developerAge Rating: Either 4+, 9+, 12+ or 17+Languages: ISO2A language codesSize: Size of the app in bytesPrimary Genre: The main genreGenres: Genres of the appOriginal Release Date: When the app was releasedCurrent Version Release Date: When the app was last updatedThe first step in the analysis involves identifying anomalies and preparing the data to ensure its quality and suitability for modelling.
# Load the original dataset
original_data <- read_csv("appstore_games.csv")
# Summary of detected anomalies in the dataset
anomalies <- xray::anomalies(original_data)
knitr::kable(
anomalies$variables,
caption = "Summary of Detected Anomalies in the Original Dataset")| Variable | q | qNA | pNA | qZero | pZero | qBlank | pBlank | qInf | pInf | qDistinct | type | anomalous_percent |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Price | 17007 | 24 | 0.14% | 14212 | 83.57% | 0 | - | 0 | - | 25 | Numeric | 83.71% |
| Subtitle | 17007 | 11746 | 69.07% | 0 | - | 0 | - | 0 | - | 5011 | Character | 69.07% |
| Average User Rating | 17007 | 9446 | 55.54% | 0 | - | 0 | - | 0 | - | 10 | Numeric | 55.54% |
| User Rating Count | 17007 | 9446 | 55.54% | 0 | - | 0 | - | 0 | - | 1793 | Numeric | 55.54% |
| In-app Purchases | 17007 | 9324 | 54.82% | 14 | 0.08% | 0 | - | 0 | - | 3804 | Character | 54.91% |
| Languages | 17007 | 60 | 0.35% | 0 | - | 0 | - | 0 | - | 991 | Character | 0.35% |
| Size | 17007 | 1 | 0.01% | 0 | - | 0 | - | 0 | - | 15795 | Numeric | 0.01% |
| Name | 17007 | 0 | - | 1 | 0.01% | 0 | - | 0 | - | 16847 | Character | 0.01% |
| Age Rating | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 4 | Character | - |
| Primary Genre | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 21 | Character | - |
| Genres | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 1004 | Character | - |
| Current Version Release Date | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 2512 | Character | - |
| Original Release Date | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 3084 | Character | - |
| Developer | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 8693 | Character | - |
| Description | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 16473 | Character | - |
| URL | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 16847 | Character | - |
| ID | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 16847 | Numeric | - |
| Icon URL | 17007 | 0 | - | 0 | - | 0 | - | 0 | - | 16847 | Character | - |
Anomalies, such as duplicated observations and missing values, should be addressed to maintain the integrity and reliability of the dataset. Observations with missing values for Average User Rating, the target variable in this analysis, will be removed, as null values represent incomplete information that cannot meaningfully contribute to the machine learning process. Similarly, features like URL and ID, which are unique identifiers with no predictive value, will be excluded to reduce noise and prevent unnecessary model complexity.
# Create a clean dataset from the original data
games_data <- original_data |>
distinct() |> # Remove duplicate rows based on all columns
filter(!is.na(`Average User Rating`)) |> # Remove rows with null values in the 'Average User Rating' column
select(-URL, -ID, -`Icon URL`) # Drop unnecessary columns: 'URL', 'ID', and 'Icon URL'To avoid data leakage and ensure a robust evaluation, the dataset is to be split into training (70%), validation (15%), and test (15%) subsets using stratified sampling based on the target variable. Stratification preserves the Average User Rating distribution across subsets, ensuring representativeness. The training set serves for data transformations, preprocessing, and exploratory data analysis (EDA), while the validation set will guide hyperparameter tuning. The test set will be held out for final evaluation, ensuring an unbiased assessment of the model on unseen data.
# Load the rsample library for data splitting
library(rsample)
# Set seed for reproducibility
set.seed(123)
# 1. Split the data into training, validation, and test sets
# Initial split: 70% training, 30% for validation + test, stratified by the target variable
split <- initial_split(games_data, prop = 0.7, strata = `Average User Rating`)
# Extract the training dataset
games_train <- training(split)
# Split the remaining 30% into validation and test sets (50/50), stratified by the target variable
valid_test_split <- initial_split(testing(split), prop = 0.5, strata = `Average User Rating`)
# Extract validation and test datasets
games_validation <- training(valid_test_split)
games_test <- testing(valid_test_split)
# 2. Separate features and labels for each dataset
# Training set: features and labels
train_features_unprocessed <- games_train |> select(-`Average User Rating`)
train_labels <- games_train$`Average User Rating`
# Validation set: features and labels
validation_features_unprocessed <- games_validation |> select(-`Average User Rating`)
validation_labels <- games_validation$`Average User Rating`
# Test set: features and labels
test_features_unprocessed <- games_test |> select(-`Average User Rating`)
test_labels <- games_test$`Average User Rating`Although the mean target variable is consistent across subsets, slight variations in standard deviation reflect differences in the distribution of outliers.
# Calculate summary statistics for the datasets
summary_table <- tibble(
Dataset = c("Training", "Validation", "Test"),
Rows = c(nrow(games_train), nrow(games_validation), nrow(games_test)),
Columns = c(ncol(games_train), ncol(games_validation), ncol(games_test)),
`Target Mean` = c(
mean(games_train$`Average User Rating`),
mean(games_validation$`Average User Rating`),
mean(games_test$`Average User Rating`)
),
`Target StdDev` = c(
sd(games_train$`Average User Rating`),
sd(games_validation$`Average User Rating`),
sd(games_test$`Average User Rating`)
)
)
# Print the summary table with a caption
knitr::kable(
summary_table,
caption = "Summary of Training, Validation, and Test Sets") |>
kable_styling(
position = "center",
full_width = FALSE)| Dataset | Rows | Columns | Target Mean | Target StdDev |
|---|---|---|---|---|
| Training | 5240 | 15 | 4.062977 | 0.7491444 |
| Validation | 1124 | 15 | 4.058274 | 0.7495520 |
| Test | 1124 | 15 | 4.061833 | 0.7584212 |
The splits appear suitable for reliable training and evaluation in Figure 1.
# Combine data for ggplot
plot_split_data <- data.frame(
value = c(
games_train$`Average User Rating`,
games_validation$`Average User Rating`,
games_test$`Average User Rating`
), # Combine target variable values from all splits
split = factor(rep(c("Training Set", "Validation Set", "Test Set")))
)
# Create an interactive density plot using ggplot2 and ggplotly
ggplotly(
ggplot(plot_split_data, aes(x = value, color = split, fill = split)) +
geom_density(alpha = 0.5) + # Density plot with transparency
labs(
title = "Distributions Across Splits",
x = "Average User Rating",
y = "Density"
) +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top"
) +
scale_fill_brewer(palette = "Blues", direction = -1) + # Custom fill colours from Brewer palette
scale_color_manual(values = c("darkblue", "blue", "royalblue"))) |>
layout(
annotations = list(
text = "Hover over the plot for details", # Add a subtitle
x = 0.5, # Center the subtitle
y = 1.05, # Position above the plot
xref = "paper",
yref = "paper",
showarrow = FALSE, # Disable the arrow for annotation
font = list(size = 12, family = "Aeonik") # Set font style for the annotation
)
)Figure 1: Distributions of the training, validation, and test sets.
Feature engineering is a critical factor in machine learning models. As emphasised by Kaggle champion Xavier Conort, while algorithms are often standard, the key to success lies in the thoughtful creation and selection of features. By carefully designing features, model accuracy and robustness can be significantly improved, making feature engineering a crucial differentiator between high-performing and average models (Conort, 2013).
Feature engineering was performed to convert non-numeric attributes into numerical values, as required by XGBoost. This involved applying transformations using recipe package such as:
Size to megabytes.Age Rating to an integer by removing “+” symbols.Languages with the number of supported languages, assigning a minimum value of 1 for missing entries.Subtitle as 1 for presence, 0 otherwise.Current Version Release Date and Original Release Date.Original Release Date and Current Version Release Date to years.Description.Price > 0, otherwise 0).In-app Purchase into a binary attribute (1 if > 0, otherwise 0), handling missing values.Price or In-app Purchase > 0, otherwise 0).Description.Genres, skipping duplication with Primary Genre.# Load necessary libraries
library(recipes) # For creating and managing data preprocessing workflows
library(textrecipes) # For text-specific preprocessing steps
library(stringr) # For string manipulation
library(lubridate) # For working with date and time data
library(syuzhet) # For sentiment analysis
# Create a recipe
games_recipe <- recipe(~ ., data = train_features_unprocessed) |>
# Step 1: Convert 'Size' to megabytes
step_mutate(Size = Size / 1024 / 1024) |>
# Step 2: Remove "+" from Age Rating and transform it to an integer
step_mutate(`Age Rating` = as.integer(str_remove(`Age Rating`, "\\+"))) |>
# Step 3: Transform Languages to the number of supported languages (number of commas + 1, with a minimum of 1)
step_mutate(Languages = as.integer(ifelse(is.na(Languages), 1, str_count(Languages, ",") + 1))) |>
# Step 4: Convert Subtitle into a binary numeric value (1 for presence, 0 for absence)
step_mutate(Subtitle = as.integer(ifelse(is.na(Subtitle), 0, 1))) |>
# Step 5: Calculate market presence (difference between Current and Original Release Dates in years)
step_mutate(
`Original Release Date` = dmy(`Original Release Date`),
`Current Version Release Date` = dmy(`Current Version Release Date`),
`Market Presence` = round(as.numeric(`Current Version Release Date` - `Original Release Date`) / 365.25, 1)) |> # Divide by 365.25 to account for leap years
# Step 6: Convert Original and Current Release Dates to years as integers
step_mutate(
`Original Release Date` = as.integer(year(`Original Release Date`)),
`Current Version Release Date` = as.integer(year(`Current Version Release Date`))) |>
# Step 7: Calculate sentiment analysis score for the Description column
step_mutate(`Description Sentiment` = get_sentiment(`Description`, method = "syuzhet")) |>
# Step 8: Create a binary attribute for Pay to Download (1 if Price > 0, 0 otherwise)
step_mutate(`Pay to Download` = as.integer(ifelse(Price > 0, 1, 0))) |>
# Step 9: Handle missing In-app Purchases values (set missing to 0) and create a binary attribute
step_mutate(`In-app Purchases` = ifelse(is.na(`In-app Purchases`), "0", `In-app Purchases`)) |>
step_mutate(`In-app Purchases` = as.integer(ifelse(`In-app Purchases` == "0", 0, 1))) |>
# Step 10: Create a binary attribute for paid apps (1 if Price > 0 or In-app Purchases > 0, 0 otherwise)
step_mutate(`Paid App` = as.integer(ifelse(Price > 0 | `In-app Purchases` == 1, 1, 0))) |>
# Step 11: Create a binary attribute indicating if the app has awards/nominations in its Description
step_mutate(`Award or Nomination` = as.integer(ifelse(
str_detect(`Description`, regex("award|nominat|prize|recognition", ignore_case = TRUE)),
1, 0))) |>
# Step 12: Handle Genres column by creating binary columns for each genre
step_mutate(Genres = gsub("(\\w+)\\s(\\w+)", "\\1_\\2", Genres)) |> # Replace spaces in multi-word genres with underscores
step_tokenize(Genres) |> # Tokenise the Genres column
step_tf(Genres) |> # Create binary columns for each genre
step_rename_at(all_predictors(), fn = ~ gsub("tf_", "", ., fixed = TRUE)) |> # Remove "tf_" prefix from column names
step_rename_at(all_predictors(), fn = ~ gsub("_", " ", ., fixed = TRUE)) |> # Replace underscores back with spaces
# Step 13: Select only numeric variables (remove non-numeric fields)
step_select(where(is.numeric))
# Prepare the recipe
games_recipe <- prep(games_recipe, training = train_features_unprocessed, retain = TRUE)
# Apply the transformations to the training dataset
train_features <- bake(games_recipe, new_data = train_features_unprocessed)
# Combine preprocessed features with labels for Exploratory Data Analysis (EDA)
train_set <- bind_cols(train_features, `Average User Rating` = train_labels)
# Apply the transformations to the validation dataset
validation_features <- bake(games_recipe, new_data = validation_features_unprocessed)
# Apply the transformations to the test dataset
test_features <- bake(games_recipe, new_data = test_features_unprocessed)These transformations expanded the training dataset from 15 to 60 variables, capturing a more comprehensive data representation.
tibble [5,240 × 60] (S3: tbl_df/tbl/data.frame)
$ Subtitle : int [1:5240] 0 0 0 1 0 0 0 0 0 0 ...
$ User Rating Count : num [1:5240] 284 8376 28 47 35 ...
$ Price : num [1:5240] 1.99 0 2.99 0 0 0 0 0 0 0.99 ...
$ In-app Purchases : int [1:5240] 0 0 0 1 0 0 0 0 0 0 ...
$ Age Rating : int [1:5240] 4 4 4 4 4 4 4 4 4 4 ...
$ Languages : int [1:5240] 1 1 15 1 1 1 1 1 1 1 ...
$ Size : num [1:5240] 11.758 0.644 33.082 46.418 6.035 ...
$ Original Release Date : int [1:5240] 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Current Version Release Date: int [1:5240] 2018 2017 2018 2019 2013 2017 2008 2008 2009 2019 ...
$ Market Presence : num [1:5240] 9.8 9.2 10 10.7 5.3 9.3 0 0.3 0.9 10.7 ...
$ Description Sentiment : num [1:5240] 8.4 3.8 6.95 12.85 1.8 ...
$ Pay to Download : int [1:5240] 1 0 1 0 0 0 0 0 0 1 ...
$ Paid App : int [1:5240] 1 0 1 1 0 0 0 0 0 1 ...
$ Award or Nomination : int [1:5240] 0 0 1 0 0 0 0 0 0 1 ...
$ Genres action : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres adventure : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres board : int [1:5240] 1 1 1 0 0 1 0 0 0 0 ...
$ Genres books : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres business : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres card : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres casino : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres casual : int [1:5240] 0 0 0 0 0 0 1 0 0 0 ...
$ Genres drink : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres education : int [1:5240] 0 0 1 0 0 0 0 0 0 0 ...
$ Genres emoji : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres entertainment : int [1:5240] 0 0 0 1 1 1 0 0 1 0 ...
$ Genres expressions : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres family : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres finance : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres fitness : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres food : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres games : int [1:5240] 1 1 1 1 1 1 1 1 1 1 ...
$ Genres gaming : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres health : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres kids : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres lifestyle : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres magazines : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres medical : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres music : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres navigation : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres news : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres newspapers : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres photo : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres productivity : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres puzzle : int [1:5240] 0 0 0 1 1 0 0 1 0 1 ...
$ Genres racing : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres reference : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres role playing : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres shopping : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres simulation : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres social networking : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres sports : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres stickers : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres strategy : int [1:5240] 1 1 1 1 1 1 1 1 1 1 ...
$ Genres travel : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres trivia : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres utilities : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres video : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Genres word : int [1:5240] 0 0 0 0 0 0 0 0 0 0 ...
$ Average User Rating : num [1:5240] 3.5 3 3.5 3 2.5 2.5 2.5 3.5 3 3.5 ...
The preprocessing and transformations were first applied to the training data and subsequently mirrored on the validation and test sets to maintain consistency and avoid data leakage. By carefully designing features and applying rigorous preprocessing, the dataset was effectively prepared for modelling with XGBoost, providing a solid foundation for exploring patterns and making accurate predictions.
We can now delve into the factors influencing Average User Rating, to provide a comprehensive overview of patterns, relationships, and trends within the data. By examining key variables and their interactions, this analysis seeks to uncover meaningful insights into user satisfaction, laying the groundwork for robust predictive modelling.
While numerous variables could potentially explain the variation in Average User Rating, this analysis is focused on the main predictors. This decision stems from the inherent complexity of relationships between variables, making it challenging to identify all significant contributors manually. Advanced machine learning models like XGBoost are particularly valuable in this context, as they are designed to automatically detect and prioritise the most important variables, even when the relationships are non-linear or subtle.
By leveraging XGBoost, the analysis ensures that important factors are not overlooked due to the limitations of traditional methods or subjective bias. This allows us to streamline the analysis by concentrating on the most influential variables while the algorithm captures complex interactions and dependencies. This methodology enhances the insights’ accuracy and makes the analysis more focused and actionable.
Understanding the distribution of the target variable (Average User Rating) is crucial in machine learning, as it directly influences the performance and reliability of predictive models. Figure 2 illustrates this distribution, segmented by whether the app is free or paid.
# Create the bar plot
ggplot(train_set, aes(x = `Average User Rating`, fill = factor(`Paid App`))) +
scale_x_continuous(breaks = seq(0, 5, by = 0.5)) +
geom_bar(bins = 30, alpha = 0.8, position = "stack", color = "black", width = 0.3) +
labs(
title = "Distribution of Average User Rating",
x = "Average User Rating",
y = "Count",
fill = "Free App?"
) +
scale_fill_manual(values = c("lightblue", "skyblue"), labels = c("Yes", "No")) +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top"
)
Figure 2: Distribution of the target variable (Average User Rating).
The chart reveals a left-skewed distribution, with a pronounced peak in the 4.0 to 4.5 range and relatively few extremely low ratings. Paid apps are prevalent across all rating ranges, representing most apps with both high and low ratings. In contrast, though fewer, free apps exhibit a more even distribution across rating categories and appear to have slightly less variability in average ratings compared to paid apps (Figure 3).
# Load the ggridges package for creating ridgeline plots
library(ggridges)
# Create the ridgeline plot to visualise the distribution of 'Average User Rating'
ggplot(train_set, aes(x = `Average User Rating`, y = factor(`Paid App`, labels = c("Free App", "Paid App")), fill = factor(`Paid App`))) +
geom_density_ridges(alpha = 0.8, scale = 1) +
scale_x_continuous(breaks = seq(0, 5, by = 0.5)) +
scale_fill_manual(values = c("lightblue", "skyblue")) +
labs(
title = "Distribution of Average User Rating",
x = "Average User Rating",
y = "",
fill = "Free App?"
) +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "none"
)
Figure 3: The ridgeline plot of the Average User Rating distribution split by the app type.
This concentration of ratings in the higher range has significant implications for machine learning modelling. The skewed distribution introduces class imbalance, with most ratings clustered within a narrow range (e.g., 4.0–4.5), potentially making it difficult for models to accurately predict lower or extreme ratings, which are underrepresented in the dataset. Furthermore, the dominance of paid apps suggests that pricing may interact with other features, adding complexity to determining variable importance in the model.
Figure 4 provides a comprehensive overview of the relationships between several potentially influential predictors — User Rating Count, Languages, Size, Age Rating, and Original Release Date — and the Average User Rating.
library(GGally)
# Select top numeric predictors and target variable
ggpairs_data <- train_set |>
select(`Average User Rating`, `User Rating Count`, `Languages`, `Size`, `Age Rating`, `Original Release Date`)
# Pair plot
ggpairs(
ggpairs_data,
aes(fill = `Average User Rating`),
upper = list(continuous = wrap("cor", size = 2.5)),
lower = list(continuous = wrap("points", alpha = 0.6))
) +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top"
) +
labs(title = "Key Predictors and Average User Rating")Figure 4: The scatterplot matrix of several potential predictors.
The analysis reveals that the number of supported Languages has a weak but statistically significant positive correlation with Average User Rating. Similarly, the Original Release Date shows a moderate positive correlation, indicating that newer apps receive higher ratings, possibly reflecting alignment with current user preferences or trends. Additionally, a weak to moderate positive correlation is observed between Age Rating and Average User Rating, suggesting that as Age Rating increases (indicating content suitable for older audiences), Average User Rating tends to rise slightly. While these relationships are statistically significant, their correlation values indicate that the associations are not particularly strong, suggesting the multifactorial nature of user ratings, with no single variable strongly driving the outcome. This highlights the need for a comprehensive modelling approach capable of capturing the complex interactions between multiple predictors, as these interactions likely influence user ratings in nuanced and non-linear ways.
Figure 5 visualises the relationship between Description Sentiment and Average User Rating, incorporating additional dimensions such as User Rating Count (indicated by bubble size) and Age Rating (represented by colour intensity). This chart provides insights into how the tone of app descriptions correlates with user satisfaction and other factors.
# Bubble plot of Description Sentiment vs. Average User Rating with bubble size as User Rating Count
ggplotly(ggplot(train_set, aes(x = `Description Sentiment`, y = `Average User Rating`, size = `User Rating Count`, color = `Age Rating`)) +
geom_point(alpha = 0.6) +
scale_color_gradient(low = "lightblue", high = "darkblue", name = "Age Rating") +
scale_size_continuous(range = c(1, 10), name = "User Rating Count", labels = comma) + # Format legend numbers
scale_y_continuous(breaks = seq(0, 5, 0.5)) +
labs(title = "Description Sentiment vs. Average User Rating",
x = "Description Sentiment",
y = "Average User Rating") +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top"
)) |>
layout(
annotations = list(
text = "Hover over the plot for details",
x = 0.5, # Center the subtitle
y = 1.05, # Position above the chart
xref = "paper",
yref = "paper",
showarrow = FALSE,
font = list(size = 12, family = "Aeonik")
)
)
Figure 5: The relationship between Description Sentiment and Average User Rating. User Rating Count is indicated by bubble size and Age Rating is represented by colour intensity.
The data reveals a weak association between Description Sentiment and Average User Rating, with apps scoring neutral to slightly positive sentiments (sentiment scores around 0 to 10) dominating the rating range. Extreme sentiment scores, whether highly positive or negative, are less common and do not exhibit clear trends of higher or lower ratings. This suggests that while description sentiment may influence user perceptions to some extent, it is not a strong standalone predictor of user ratings.
Bubble sizes indicate substantial variation in user engagement across apps. Larger bubbles, representing apps with higher User Rating Counts, are concentrated around Average User Ratings of 4.0 to 4.5, indicating that apps with higher ratings generally attract more feedback. Notably, there is an obvious outlier in terms of User Rating Count, concentrated around an Average User Rating of 4.5. Conversely, smaller bubbles, corresponding to apps with fewer ratings, show greater variability in average ratings, which may result from limited feedback and higher volatility.
The colour intensity, representing Age Rating, does not exhibit a strong relationship with either Description Sentiment or Average User Rating. Apps across all age groups are spread throughout the sentiment spectrum, indicating that content appropriateness (as measured by age rating) does not significantly correlate with sentiment or satisfaction.
The plot highlights that while Description Sentiment plays a minor role, user ratings are influenced by a combination of factors. Apps with neutral to moderately positive sentiment and high user engagement tend to achieve higher ratings, but sentiment alone does not provide a strong predictive signal for user satisfaction. This underscores the multifactorial nature of user ratings, requiring a broader consideration of variables to fully understand the drivers of user satisfaction.
Figure 6 illustrates the relationship between Average User Rating and the average User Rating Count across various genres, providing a clear visualisation of top-performing categories in terms of both ratings and user engagement. This analysis offers valuable insights into the genres that resonate most with users, serving as a foundation for selecting relevant genres in predictive models.
library(reshape2)
# Create a data subset with avg. user rating per genre
genre_columns <- grep("^Genres", colnames(train_set), value = TRUE)
heatmap_data <- train_set |>
select(all_of(genre_columns), `Average User Rating`, `User Rating Count`) |>
pivot_longer(cols = all_of(genre_columns), names_to = "Genre", values_to = "Genre_Present") |>
filter(Genre_Present == 1) |> # Filter only those rows where the app is present in the genre
mutate(Genre = sub("Genres ", "", Genre)) |>
group_by(Genre) |>
summarize(
`Average User Rating` = mean(`Average User Rating`, na.rm = TRUE),
`Average User Rating Count` = mean(`User Rating Count`, na.rm = TRUE)
) |>
ungroup()
# Plot a radar chart
ggplot(heatmap_data, aes(x = reorder(Genre, -`Average User Rating`), y = `Average User Rating`, fill = `Average User Rating Count`)) +
geom_col() +
coord_polar(theta = "x") +
geom_text(aes(label = round(`Average User Rating`, 1)),
position = position_stack(vjust = 0.95),
size = 2,
color = "white") + # Add labels for Average Rating
labs(
title = "Average User Rating by Genre",
x = NULL,
y = NULL,
fill = "Average User Rating Count" # Legend title
) +
scale_fill_gradient(
low = "lightblue", high = "darkblue",
labels = scales::comma # Format numbers with commas
) +
theme_minimal(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5, size = 12),
legend.position = "right",
legend.title = element_text(size = 10),
legend.text = element_text(size = 6),
axis.text.x = element_text(angle = 45, hjust = 2, size = 6),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_line(color = "gray90", size = 0.3),
panel.grid.major.y = element_blank()
)
Figure 6: The relationship between Average User Rating and the average User Rating Count across genres.
High-performing genres, such as Newspapers and Magazines, achieve perfect average ratings (5.0) but have significantly fewer user ratings, indicating that limited feedback may inflate these averages. Conversely, genres like Kids and Reference exhibit very low average ratings and small numbers of user reviews, reflecting less user engagement. In contrast, mid-range genres like Finance and Action display broader user engagement, as indicated by higher rating counts, and maintain more stable average ratings around 4.1.
This pattern highlights an important trend: genres with extreme ratings (either very high or very low) are often associated with fewer reviews, leading to less stable averages influenced by initial or limited feedback. In contrast, genres with higher engagement tend to have more stable, moderate ratings, representing a wider range of user experiences.
These findings underscore the need to consider both Average User Rating and User Rating Count in predictive analysis and strategic decisions. Combining these metrics provides a more comprehensive understanding of user preferences and satisfaction, enabling more accurate and actionable insights.
XGBoost is an excellent choice for predicting Average User Rating because it handles structured data with exceptional efficiency and accuracy. Its gradient boosting framework is particularly effective at capturing complex, non-linear relationships between features and the target variable. Additionally, XGBoost incorporates advanced regularisation techniques to mitigate overfitting, making it highly robust in handling data with variability and noise — common challenges in predicting user satisfaction metrics.
The algorithm’s flexibility in hyperparameter tuning allows for precise optimisation to achieve high performance, further solidifying its reputation as a leading solution for structured data problems. These strengths align perfectly with this analysis’s goals, ensuring accuracy and reliability in the predictive modelling.
Moreover, XGBoost is remarkably insensitive to the normalisation technique, with similar accuracy results, regardless of whether normalising data or not (Cabello-Solorzano et al., 2023). As a result, data normalisation or scaling steps were intentionally omitted in this project, streamlining the workflow while maintaining the model’s effectiveness.
We are now prepared to begin building the model. We initialise the xgboost() function with its default parameters. XGBoost requires the data to be converted into a matrix format, as it does not process tibbles or data frames directly.
library(xgboost)
# Convert training, validation, and test sets into a matrix suitable for XGBoost processing
train_dmatrix <- xgb.DMatrix(data = as.matrix(train_features), label = train_labels)
validation_dmatrix <- xgb.DMatrix(data = as.matrix(validation_features))
test_dmatrix <- xgb.DMatrix(data = as.matrix(test_features))library(tictoc)
# Start timer
tic("Training the model")
# Train an XGBoost model using xgboost() function
xgb_model_base <- xgboost(
data = train_dmatrix, # Training data in DMatrix format
nrounds = 200, # Number of boosting iterations
objective = "reg:squarederror", # Objective function for regression tasks
eval_metric = "rmse", # Evaluation metric: RMSE to evaluate model performance
verbose = 0 # Suppress output during training
)
# Stop timer and display elapsed time
toc()Training the model: 1.838 sec elapsed
$objective
[1] "reg:squarederror"
$eval_metric
[1] "rmse"
$validate_parameters
[1] TRUE
Once the basic XGBoost model is trained, we generate a SHAP summary plot to analyse the importance of features and their impact on predictions. Figure 7 provides a detailed view of how each feature influences model predictions. Each point represents an individual prediction, with the x-axis indicating the SHAP value, quantifying a feature’s contribution to the prediction. The colour gradient illustrates feature values, with one end representing higher values and the other lower values. Features are ranked by importance, with the most influential at the top. This visualisation reveals the key drivers of predictions and whether their effects are positive or negative, offering valuable insights into feature interactions and model behaviour.
# Generate SHAP summary plot
xgb.ggplot.shap.summary(
model = xgb_model_base,
data = as.matrix(train_features)) +
ggtitle("SHAP Feature Importance of the XGBoost Model") +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5)) +
labs(
x = "",
y = "SHAP value (impact on model output)",
color = "Feature Value"
) Figure 7: The SHAP summary plot of the baseline XGBoost model.
From the SHAP plot, we observe that Size, User Rating Count, and Description Sentiment are the most influential features affecting Average User Rating. Additionally, features such as recent Original Release Dates and Current Version Dates are positively associated with higher target values, whereas high Market Presence negatively impacts the target variable. These interactions provide insights into how specific features shape predictions and help explain the model’s performance. Notably, four of the most influential features were derived through feature engineering, including Description Sentiment, number of supported Languages, Market Presence-related metrics, and simulation genre. This underscores the critical role of feature engineering in enhancing the model’s predictive power.
The evaluation metrics for the basic XGBoost model indicate reasonable performance and highlight areas for improvement.
# Load necessary library for evaluation metrics
library(Metrics)
# Define the function to calculate evaluation metrics
evaluation_metrics <- function(actual_labels, predicted_labels, train_features) {
# Residual sum of squares (RSS)
rss <- sum((actual_labels - predicted_labels)^2)
# Number of observations and predictors
n <- length(predicted_labels)
p <- ncol(train_features)
# AIC and BIC calculations
aic <- n * log(rss / n) + 2 * p
bic <- n * log(rss / n) + log(n) * p
# Calculate metrics
metrics <- data.frame(
Metric = c("RMSE", "MAE", "MAPE", "R-squared", "AIC", "BIC"),
Value = c(
sqrt(mean((actual_labels - predicted_labels)^2)), # RMSE
mean(abs(actual_labels - predicted_labels)), # MAE
mean(abs((actual_labels - predicted_labels) / actual_labels)) * 100, # MAPE
cor(actual_labels, predicted_labels)^2, # R-squared
aic, # AIC
bic # BIC
)
)
# Return formatted table with the calculated metrics
return(metrics)
}# Predicted validation values using the trained model
validation_predictions_base <- predict(xgb_model_base, newdata = validation_dmatrix)
# Calculate evaluation metrics
evaluation_metrics(validation_labels, validation_predictions_base, train_features) |>
# Convert to a kable table
kable(
col.names = c("Evaluation Metric", "Value"),
caption = "Base Model Performance Metrics",
digits = 4
) |>
kable_styling(
position = "center",
full_width = FALSE
) | Evaluation Metric | Value |
|---|---|
| RMSE | 0.6988 |
| MAE | 0.5210 |
| MAPE | 15.3379 |
| R-squared | 0.1756 |
| AIC | -687.8212 |
| BIC | -391.3669 |
Metrics like RMSE and MAE reveal moderate prediction errors, suggesting the model captures overall trends but struggles with precise value predictions. The MAPE of 15.34% indicates that the model’s predictions are, on average, 15% off the actual value. While the negative AIC and BIC values imply a good balance between model fit and complexity, the low R-squared reflects the challenge of modelling non-linear relationships and high data variability. However, it is important to note that in contexts involving complex data, such as clinical research, even R-squared values above 15% are considered meaningful (Gupta, Stead, and Ganti, 2024).
Rounding the target variable in the original dataset to the nearest 0.5 introduces additional challenges in assessing model performance. The model predicts values like 3.16, which reflects its smoothing and estimation process.
# Display the top 5 predictions from the validation set
cat("Top 5 Predictions:" , head(validation_predictions_base))Top 5 Predictions: 4.611531 3.105643 3.374446 3.160298 3.590192 3.501201
Manipulating the predicted variable by rounding can distort the model’s performance metrics and interpretation, leading to misleading results. Rounding reduces the precision of the predictions, which may hide the true accuracy and nuances captured by the model. Instead, using the raw model outputs provides a clearer picture of its performance, allowing for a better evaluation of how well it captures underlying trends in the data.
Comparing actual and predicted densities in Figure 8, we observe discrepancies, particularly in the first quartile. However, this underperformance in the smaller quartile was anticipated during the exploratory data analysis (EDA), as lower ratings are underrepresented in the dataset. The model’s difficulties in predicting this segment align with expectations based on the observed skewness and data distribution patterns.
# Function to plot density (distribution) for actual vs predicted values
plot_density_distribution <- function(actual_values, predicted_values) {
# Create a data frame with actual and predicted values
data <- data.frame(
value = c(actual_values, predicted_values),
type = rep(c("Actual", "Predicted"), c(length(actual_values), length(predicted_values)))
)
# Plot density curves for both actual and predicted values
ggplot(data, aes(x = value, fill = type, color = type)) +
geom_density(alpha = 0.5, size = 1) + # Create density curves with transparency
scale_fill_manual(values = c("Actual" = "skyblue", "Predicted" = "lightblue")) +
scale_color_manual(values = c("Actual" = "skyblue", "Predicted" = "lightblue")) +
labs(title = "Density Distribution of the Target Variable",
x = "Average User Rating", y = "Density",
fill = "", color = "") + # Axis and legend labels
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "top"
) +
scale_fill_brewer(palette = "Blues", direction = 0)
}
# Plot density distribution
plot_density_distribution(validation_labels, validation_predictions_base) +
ggtitle ("XGBoost Basic Model: Density Distribution of the Target Variable")Figure 8: Density distributions of the actual values of the target variable from the validation set and the values predicted by the baseline XGBoost model.
The summary statistics confirm that the predicted values differ from the validation set primarily in the lower quartile, highlighting discrepancies in this segment. Additionally, the maximum predicted value exceeds the observed range, indicating an overestimation in some cases.
# Generate summary statistics for the actual target variable
glance(summary(validation_labels)) |>
kable(
caption = "Summary Statistics of the Actual Target Variable",
digits = 2
) |>
kable_styling(
position = "center",
full_width = FALSE
) | minimum | q1 | median | mean | q3 | maximum |
|---|---|---|---|---|---|
| 1 | 3.5 | 4 | 4.06 | 4.5 | 5 |
# Generate summary statistics for the predicted target variable
glance(summary(validation_predictions_base)) |>
kable(
caption = "Summary Statistics of the Predicted Target Variable",
digits = 2) |>
kable_styling(
position = "center",
full_width = FALSE) | minimum | q1 | median | mean | q3 | maximum |
|---|---|---|---|---|---|
| 1.97 | 3.81 | 4.13 | 4.05 | 4.38 | 5.17 |
The basic XGBoost model demonstrates solid initial performance, with promising evaluation metrics such as RMSE and MAE that align reasonably well with the variability in the data, as indicated by the standard deviation (0.75). However, further refinements, such as advanced hyperparameter tuning, could enhance its accuracy and improve its ability to predict the target variable more effectively.
caretTuning the basic XGBoost model with the caret package enhances its performance by systematically optimising hyperparameters such as learning rate, tree depth, and subsampling. These adjustments reduce overfitting, improve predictive accuracy, and better capture the underlying data patterns. The caret package employs techniques like grid search and cross-validation to thoroughly explore the hyperparameter space, enabling a more robust and fine-tuned model. This approach balances model complexity and performance, resulting in more reliable and accurate predictions.
Given the computational intensity of training XGBoost models with caret, particularly when exploring a large hyperparameter grid, the range of parameters was deliberately limited to optimise efficiency. By narrowing the hyperparameter search space, it was still possible to identify a set of parameters that improved model performance while maintaining computational feasibility. This strategy allowed for significant enhancements in predictive accuracy without overwhelming system resources.
library(caret)
# Start timer
tic("Training the model")
# Define the hyperparameter grid to search over
xgb_grid <- expand.grid(
nrounds = 100,
eta = c(0.03, 0.05, 0.1), # Learning rate (you can explore lower values to avoid overfitting)
gamma = c(0, 1), # Minimum loss reduction for split
max_depth = c(6, 7), # Depth of trees (shallower trees often prevent overfitting)
min_child_weight = c(2, 3), # Minimum number of data points in a child node (to control splits)
subsample = c(0.9, 1), # Fraction of data used for training in each iteration
colsample_bytree = c(0.9, 1) # Fraction of features used for training in each iteration
)
# Set up cross-validation control
train_control <- caret::trainControl(
method = "cv", # 10-fold cross-validation
number = 3, # 3 repeats of 10-fold cross-validation
verboseIter = FALSE, # Do not display progress
search = "grid", # Grid search method for hyperparameter tuning
summaryFunction = defaultSummary, # RMSE will be used by default
allowParallel = TRUE
)
# Train the XGBoost model using caret
set.seed(123) # Set seed for reproducibility
xgb_caret_model <- caret::train(
x = train_features,
y = train_labels,
method = "xgbTree", # XGBoost method in caret
trControl = train_control, # Cross-validation settings
tuneGrid = xgb_grid # Grid of hyperparameters to search
)
# Stop timer and display elapsed time
toc()Training the model: 191.085 sec elapsed
| nrounds | max_depth | eta | gamma | colsample_bytree | min_child_weight | subsample | |
|---|---|---|---|---|---|---|---|
| 43 | 100 | 6 | 0.05 | 1 | 0.9 | 3 | 0.9 |
As visualised in Figure 9, the tuned model’s predictions show a greater concentration of values between the 1st and 3rd quartiles, reflecting an improvement in capturing the target variable’s distribution.
# Predicted validation values using the trained model
validation_predictions_caret <- predict(xgb_caret_model, newdata = validation_features)
# Plot the density distribution for actual vs predicted values
plot_density_distribution(validation_labels, validation_predictions_caret) +
ggtitle ("`caret`-Tuned XGBoost Model: Density Distribution of the Target Variable")
Figure 9: Density distributions of the actual values of the target variable from the validation set and the values predicted by the XGBoost model tuned with caret.
# Generate summary statistics for the actual target variable
glance(summary(validation_labels)) |>
kable(
caption = "Summary Statistics of the Actual Target Variable",
digits = 2) |>
kable_styling(
position = "center",
full_width = FALSE) | minimum | q1 | median | mean | q3 | maximum |
|---|---|---|---|---|---|
| 1 | 3.5 | 4 | 4.06 | 4.5 | 5 |
# Generate summary statistics for the predicted target variable
glance(summary(validation_predictions_caret)) |>
kable(
caption = "Summary Statistics of the Predicted Target Variable",
digits = 2) |>
kable_styling(
position = "center",
full_width = FALSE) | minimum | q1 | median | mean | q3 | maximum |
|---|---|---|---|---|---|
| 2.74 | 3.84 | 4.07 | 4.03 | 4.29 | 4.7 |
Evaluation metrics demonstrate a noticeable performance boost over the base model, with a more than 30% improvement in BIC and meaningful gains in R-squared. These improvements are particularly significant for complex datasets, where even small increases in model performance can provide valuable insights.
# Calculate evaluation metrics
evaluation_metrics(validation_labels, validation_predictions_caret, train_features) |>
# display the summary statistics in a table format
kable(
col.names = c("Evaluation Metric", "Value"),
caption = "Caret-Tuned Model Performance Metrics",
digits = 4) |>
kable_styling(
position = "center",
full_width = FALSE) | Evaluation Metric | Value |
|---|---|
| RMSE | 0.6627 |
| MAE | 0.4978 |
| MAPE | 14.8479 |
| R-squared | 0.2195 |
| AIC | -806.7621 |
| BIC | -510.3078 |
While tuning with caret significantly enhanced model accuracy and robustness, it remains computationally intensive, particularly with larger hyperparameter grids. This highlights the importance of balancing hyperparameter optimisation with computational efficiency to achieve reliable results without exceeding system capabilities.
xgb.train()The xgb.train() function from the xgboost package offers greater efficiency and flexibility than the xgboost(), which was used for training the basic model. It supports advanced features such as early stopping, fine-grained control over training parameters, and enhanced handling of evaluation metrics. These capabilities make xgb.train() a more tailored and faster solution for gradient boosting tasks, especially when compared to the computationally intensive tuning process with the caret package.
This model’s parameters were set based on the best-performing hyperparameters identified during tuning with caret. This approach leverages the exhaustive grid search and cross-validation performed by caret to establish a strong baseline for further refinement using xgb.train(). By adopting these pre-optimised parameters, the xgb.train() model benefits from the insights gained during the more intensive caret tuning process while achieving greater computational efficiency.
# Set a seed for reproducibility
set.seed(123)
# Start timer
tic("Training the model")
# Define the parameter list for XGBoost
param <- list(
objective = "reg:squarederror", # Objective function for regression
eval_metric = "rmse", # Evaluation metric: Root Mean Squared Error
eta = 0.05, # Lower learning rate to prevent overfitting
max_depth = 6, # Shallower trees to reduce complexity
min_child_weight = 3, # Conservative splits to control overfitting
subsample = 0.9, # Randomly sample 90% of data
colsample_bytree = 1, # Use all features for training
lambda = 1 # L2 regularisation to prevent overfitting
)
# Perform cross-validation to tune hyperparameters
cv_results <- xgb.cv(
params = param,
data = train_dmatrix,
nrounds = 200, # Number of boosting iterations
nfold = 10, # 10-fold cross-validation
early_stopping_rounds = 10, # Stop if no improvement for 10 rounds
verbose = FALSE # Do not print results for each fold
)
# Train the model with the best parameters found during cross-validation
xgb_model <- xgb.train(
params = param, # Parameters for XGBoost
data = train_dmatrix,
nrounds = cv_results$best_iteration, # Number of boosting iterations based on cross-validation results
)
# Stop timer and display elapsed time
toc()Training the model: 16.333 sec elapsed
##### xgb.Booster
raw: 465.1 Kb
call:
xgb.train(params = param, data = train_dmatrix, nrounds = cv_results$best_iteration)
params (as set within xgb.train):
objective = "reg:squarederror", eval_metric = "rmse", eta = "0.05", max_depth = "6", min_child_weight = "3", subsample = "0.9", colsample_bytree = "1", lambda = "1", validate_parameters = "TRUE"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
# of features: 59
niter: 133
nfeatures : 59
Evaluating the performance of this trained model on the validation set highlights the impact of tuning on predictive power.
# Predicted validation values using the trained model
validation_predictions <- predict(xgb_model, newdata = validation_dmatrix)
# Calculate evaluation metrics
evaluation_metrics(validation_labels, validation_predictions, train_features) |>
kable(
col.names = c("Evaluation Metric", "Value"),
caption = "Final Model Performance Metrics",
digits = 4) |>
kable_styling(
position = "center",
full_width = FALSE) | Evaluation Metric | Value |
|---|---|
| RMSE | 0.6589 |
| MAE | 0.4915 |
| MAPE | 14.7264 |
| R-squared | 0.2265 |
| AIC | -819.6650 |
| BIC | -523.2107 |
The increase in R-squared from 0.2195 to 0.2265 indicates improved model performance in explaining variance within the data. Additionally, metrics such as RMSE, MAE, and MAPE show noticeable improvements, confirming that the model’s predictions are closer to the actual values. The decrease in AIC and BIC further validates that the tuned model achieves a better balance between complexity and fit, resulting in a more accurate representation of the underlying data patterns. Importantly, tuning with xgb.train() proved more computationally efficient than using the caret package, offering a streamlined approach to optimisation.
Figure 10 comparing actual and predicted values of the target variable shows that the tuned model’s predictions exhibit a similar density distribution to the model tuned with caret, visually confirming its accuracy.
# Generate summary statistics for the actual target variable
glance(summary(validation_labels)) |>
kable(
caption = "Summary Statistics of the Actual Target Variable",
digits = 2
) |>
kable_styling(
position = "center",
full_width = FALSE
) | minimum | q1 | median | mean | q3 | maximum |
|---|---|---|---|---|---|
| 1 | 3.5 | 4 | 4.06 | 4.5 | 5 |
# Generate summary statistics for the predicted target variable
glance(summary(validation_predictions)) |>
kable(
caption = "Summary Statistics of the Predicted Target Variable",
digits = 2) |>
kable_styling(
position = "center",
full_width = FALSE) | minimum | q1 | median | mean | q3 | maximum |
|---|---|---|---|---|---|
| 2.7 | 3.86 | 4.1 | 4.05 | 4.31 | 4.77 |
# Plot the density distribution for actual vs predicted values
plot_density_distribution(validation_labels, validation_predictions) +
ggtitle ("XGBoost `xgb.train` Model: Density Distribution of the Target Variable")
Figure 10: Density distributions of the actual values of the target variable from the validation set and the values predicted by the XGBoost model using xgb.train().
Reviewing the importance of features in this model in Figure 11 reveals notable changes compared to the base model. User Rating Count has emerged as the most significant driver of Average User Rating, surpassing app Size in influence, while Description Sentiment no longer ranks among the top three features.
# Generate SHAP summary plot
xgb.ggplot.shap.summary(
model = xgb_model,
data = as.matrix(train_features)) +
ggtitle("SHAP Feature Importance of the XGBoost Model") +
theme_tufte(base_family = "Aeonik") +
theme(
plot.title = element_text(hjust = 0.5)) +
labs(
x = "",
y = "SHAP value (impact on model output)",
color = "Feature Value"
)
Figure 11: The SHAP summary plot of the XGBoost model using xgb.train().
These results underscore the benefits of tuning with xgb.train(), not only in terms of computational efficiency but also in achieving a refined understanding of feature importance and enhancing the model’s predictive performance.
As the final model demonstrated the best evaluation metrics on the validation data, applying it to the test dataset is essential for assessing its performance comprehensively. This step evaluates the models’ ability to generalise to unseen data, ensuring that the improvements observed during validation translate effectively to real-world scenarios.
Predictions on the test data across all three models revealed a slight decline in performance compared to the validation set, suggesting that the models fit the test data slightly less effectively.
# Predicted test values using from the base XGBoost model
test_predictions_base <- predict(xgb_model_base, newdata = test_dmatrix)
# Predicted test values using from the caret-tuned XGBoost model
test_predictions_caret <- predict(xgb_caret_model, newdata = test_features)
# Predicted test values using from the final XGBoost model
test_predictions <- predict(xgb_model, newdata = test_dmatrix)
# Convert data frames to tables using kable for evaluation metrics
kable(evaluation_metrics(test_labels, test_predictions_base, train_features), caption = "Base Model's Evaluation on Test Data") |>
kable_styling(
position = "center",
full_width = FALSE) | Metric | Value |
|---|---|
| RMSE | 0.7393065 |
| MAE | 0.5400284 |
| MAPE | 16.1484475 |
| R-squared | 0.1244816 |
| AIC | -560.9920841 |
| BIC | -264.5377913 |
kable(evaluation_metrics(test_labels, test_predictions_caret, train_features), caption = "Caret-Tuned Model's Evaluation on Test Data") |>
kable_styling(
position = "center",
full_width = FALSE) | Metric | Value |
|---|---|
| RMSE | 0.6844488 |
| MAE | 0.5078527 |
| MAPE | 15.2016042 |
| R-squared | 0.1863303 |
| AIC | -734.3097936 |
| BIC | -437.8555008 |
kable(evaluation_metrics(test_labels, test_predictions, train_features), caption = "Final Model's Evaluation on Test Data") |>
kable_styling(
position = "center",
full_width = FALSE) | Metric | Value |
|---|---|
| RMSE | 0.6861636 |
| MAE | 0.5053039 |
| MAPE | 15.2287369 |
| R-squared | 0.1831256 |
| AIC | -728.6848398 |
| BIC | -432.2305470 |
However, the evaluation metrics remain within acceptable ranges, indicating that all models generalise reasonably well to new data. Among them, the caret-tuned model performed slightly better on the test data than the final XGBoost model, though the differences are minor. The trade-off lies in computational efficiency: while the caret-tuned model achieved marginally better results, it required significantly more time and resources during training and tuning. In contrast, the final model offers a practical balance, delivering robust performance with greater computational efficiency, making it an attractive choice for resource-constrained applications.
The slight decline in performance on the test data can be attributed to nuanced differences in patterns that were not fully captured during training and validation, despite stratified sampling. This is common in machine learning, particularly with complex models like XGBoost, which are sensitive to subtle variations in data. Additionally, the tuning process likely optimised performance for the validation set, introducing minor overfitting to its specific characteristics.
Nevertheless, the model’s consistent performance across datasets demonstrates its ability to capture overall trends effectively. These results emphasise the importance of stratified sampling in maintaining dataset consistency and achieving a balance between model complexity and predictive performance. To further enhance robustness and generalisability, future efforts should incorporate additional cross-validation techniques to capture variability across datasets better and minimise overfitting. Regularisation adjustments and exploring advanced tuning strategies, such as automated hyperparameter optimisation or ensembling with complementary algorithms, could provide further improvements and make the model even more resilient to variations in unseen data.
The data processing, exploratory data analysis (EDA), and model development phases provided valuable insights into the factors influencing Average User Rating and revealed key challenges in modelling this variable.
The target variable exhibited a significant skew, with most ratings between 4.0 and 4.5. This imbalance made it challenging to accurately predict lower or extreme ratings, as these were underrepresented in the training dataset. The EDA offered descriptive insights but did not identify variables directly impacting the target due to the complex and non-linear relationships between predictors and ratings. These complexities highlighted the necessity of advanced machine learning techniques, such as XGBoost, to uncover subtle interactions and dependencies.
Three XGBoost models were developed, each progressively tuned to improve predictive performance. While these models successfully captured overall trends, they struggled with exact value predictions, particularly in the lower quartile, due to the skewed data distribution and the inherent difficulty of predicting rounded continuous variables. The tuning enhanced metrics, such as RMSE, MAE, and MAPE, showcasing the value of hyperparameter optimisation for model robustness. However, R-squared remained a less sensitive metric in this context due to the high variability and non-linearity in the data.
The performance evaluation on the test dataset revealed a slight decline in performance compared to the validation set, but the metrics remained within acceptable ranges. This difference reflects the models’ ability to generalise effectively to unseen data, capturing overall patterns with minimal loss of accuracy. The performance gap is likely due to nuanced differences in the test data not fully captured during training, even with stratified sampling. Minor overfitting to the validation set during tuning may also have contributed.
Using caret for hyperparameter tuning provided a robust framework for optimisation but was computationally intensive. By contrast, the final model tuned with xgb.train() offered a more efficient approach while maintaining comparable predictive performance. Notably, the parameters of the final model were informed by the results of caret tuning, blending thorough optimisation with computational efficiency.
Feature importance analysis revealed that engineered features such as Description Sentiment, Market Presence, and the number of supported Languages significantly influenced predictions. The shift in importance among features across models further emphasised the complexity of interactions within the data.
Future efforts should address challenges posed by the skewed distribution through strategies such as resampling, sample weighting, or algorithm-specific adjustments. Additional cross-validation techniques and automated hyperparameter optimisation may further enhance model robustness. Despite these limitations, the models demonstrated a strong ability to generalise trends and provided actionable insights into the drivers of user satisfaction, offering a solid foundation for further predictive and exploratory analyses.
Gupta, A., Stead, T.S. and Ganti, L. (2024) ‘Determining a Meaningful R-squared Value in Clinical Medicine’, Academic Medicine & Surgery. Available at: https://doi.org/10.62186/001c.125154.
Cabello-Solorzano, K., Ortigosa de Araujo, I., Peña, M., Correia, L., J. Tallón-Ballesteros, A. (2023) ‘The Impact of Data Normalization on the Accuracy of Machine Learning Algorithms: A Comparative Analysis’. 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023). Lecture Notes in Networks and Systems, vol 750. Springer, Cham. Available at: https://doi.org/10.1007/978-3-031-42536-3_33.
Conort, X. (2013) ‘Q&A with Xavier Conort’. Interview with Xavier Conort. Interviewed by Kaggle Team for Kaggle, 10 April. Available at: https://web.archive.org/web/20190609154949/http://blog.kaggle.com/2013/04/10/qa-with-xavier-conort/ (Accessed: 4 December 2024).
Lantz, B. (2023) Machine Learning with R. 4th edn. Packt Publishing Ltd. Available at: https://learning.oreilly.com/library/view/machine-learning-with/9781801071321/ (Accessed: 4 December 2024).
# This command provides detailed information about the current R session, including the version of R, the operating system, and the list of loaded packages along with their versions. It is useful for checking compatibility of packages and diagnosing issues with installed software.
sessionInfo()R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS 26.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Madrid
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] caret_7.0-1 lattice_0.22-6 Metrics_0.1.4 tictoc_1.2.1
[5] xgboost_1.7.11.1 reshape2_1.4.4 GGally_2.2.1 ggridges_0.5.6
[9] syuzhet_1.0.7 textrecipes_1.1.0 recipes_1.3.1 rsample_1.3.0
[13] scales_1.4.0 broom_1.0.8 knitr_1.50 kableExtra_1.4.0
[17] ggthemes_5.1.0 RColorBrewer_1.1-3 plotly_4.10.4 lubridate_1.9.4
[21] forcats_1.0.0 stringr_1.5.1 purrr_1.0.4 readr_2.1.5
[25] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.2 tidyverse_2.0.0
[29] dplyr_1.1.4
loaded via a namespace (and not attached):
[1] pROC_1.18.5 rlang_1.1.6 magrittr_2.0.3
[4] furrr_0.3.1 compiler_4.5.0 systemfonts_1.2.3
[7] vctrs_0.6.5 xray_0.2 crayon_1.5.3
[10] pkgconfig_2.0.3 fastmap_1.2.0 backports_1.5.0
[13] labeling_0.4.3 rmarkdown_2.29 prodlim_2025.04.28
[16] tzdb_0.5.0 bit_4.6.0 xfun_0.52
[19] cachem_1.1.0 jsonlite_2.0.0 SnowballC_0.7.1
[22] parallel_4.5.0 R6_2.6.1 bslib_0.9.0
[25] stringi_1.8.7 parallelly_1.44.0 rpart_4.1.24
[28] jquerylib_0.1.4 Rcpp_1.0.14 bookdown_0.43
[31] iterators_1.0.14 future.apply_1.11.3 Matrix_1.7-3
[34] splines_4.5.0 nnet_7.3-20 timechange_0.3.0
[37] tidyselect_1.2.1 rstudioapi_0.17.1 yaml_2.3.10
[40] timeDate_4041.110 codetools_0.2-20 listenv_0.9.1
[43] plyr_1.8.9 withr_3.0.2 evaluate_1.0.3
[46] future_1.49.0 survival_3.8-3 ggstats_0.9.0
[49] xml2_1.3.8 pillar_1.10.2 foreach_1.5.2
[52] stats4_4.5.0 generics_0.1.4 vroom_1.6.5
[55] hms_1.1.3 globals_0.18.0 class_7.3-23
[58] glue_1.8.0 lazyeval_0.2.2 tools_4.5.0
[61] tokenizers_0.3.0 data.table_1.17.4 ModelMetrics_1.2.2.2
[64] gower_1.0.2 grid_4.5.0 crosstalk_1.2.1
[67] ipred_0.9-15 nlme_3.1-168 cli_3.6.5
[70] textshaping_1.0.1 viridisLite_0.4.2 svglite_2.2.1
[73] lava_1.8.1 gtable_0.3.6 sass_0.4.10
[76] digest_0.6.37 htmlwidgets_1.6.4 farver_2.1.2
[79] htmltools_0.5.8.1 lifecycle_1.0.4 hardhat_1.4.1
[82] httr_1.4.7 bit64_4.6.0-1 MASS_7.3-65